arXiv:1502.03655v2 [stat.CO] 11 Sep 2015 


Newton-based maximum likelihood estimation 
in nonlinear state space models 

Manon Kok*, Johan Dahlint Thomas B. Schon^ and Adrian Wills^ 

September 14, 2015 


Abstract 

Maximum likelihood (ML) estimation using Newton’s method in nonlinear state space models (SSMs) 
is a challenging problem due to the analytical intractability of the log-likelihood and its gradient 
and Hessian. We estimate the gradient and Hessian using Fisher’s identity in combination with a 
smoothing algorithm. We explore two approximations of the log-likelihood and of the solution of 
the smoothing problem. The first is a linearization approximation which is computationally cheap, 
but the accuracy typically varies between models. The second is a sampling approximation which is 
asymptotically valid for any SSM but is more computationally costly. We demonstrate our approach 
for ML parameter estimation on simulated data from two different SSMs with encouraging results. 
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1 Introduction 


Maximum likelihood (ML) parameter estimation is a ubiquitous problem in control and system identifi¬ 
cation, see e.g. Ljung [1999| for an overview. We focus on ML estimation in nonlinear state space models 
(SSMs), 


Xt+i = fg{xt,Vt), 
yt = 9e{xt,et), 


( 1 ) 


where Xt and yt denote the latent state variable and the measurement at time t, respectively. The 
functions /e(-) and <?§(•) denote the dynamic and the measurement model, respectively, including the 
noise terms denoted by vt and e*. The ML estimate of the static parameter vector 0 G 0 C IRp is obtained 
by solving 


9 ml = argmax£e(2/i:jv), 
e 


( 2 ) 


where ie{yi:N) — logP6i(2/i:Af) denotes the log-likelihood function and yi:^' = {yi,... ,yN}- 

In this paper, we aim to solve (§ using Newton methods [Nocedal and Wright[ |200^ . These enjoy 
quadratic convergence rates but require estimates of the log-likelihood and its gradient and Hessian. 
For linear Gaussian SSMs, we can compute these quantities exactly by the use of a Kalman filter (KF; 
Kalman [1960 ) and by using a Kalman smoother together with Fisher’s identity Fisher 1925|. This 


approach has previously been explored by e.g. |Segal and Weinstein| [1989| . An alternative approach is 

. However, neither of 


to compute the gradient recursively via the sensitivity derivatives Astrom |l980 


these approaches can be applied for general SSMs as they require us to solve the analytically intractable 
state estimation problem for Q. 

to 


The main contribution of this paper is an extension of the results by Segal and Weinstein 1989 


general SSMs in order to solve ([^. To this end, we explore two approximations of the log-likelihood and 
the solution of the smoothing problem, using linearization and using sampling methods. The linearization 


approximation estimates the log-likelihood using an extended Kalman filter (EKF), see e.g. Gustafsson 


2012 . The smoothing problem is solved using Gauss-Newton optimization. The approximation 


is based on particle filtering and smoothing Doucet and Johansen 2011 


The main differences between these two approximations are the accuracy of the estimates and the 
computational complexity. Sampling methods provide asymptotically consistent estimates, but at a 
high computational cost. It is therefore beneficial to investigate the linearization approximation, which 
provides estimates at a much smaller computational cost. However, the accuracy of the linearization 
typically varies between different models, requiring evaluation before they can be applied. 
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The problem of ML estimation in nonlinear SSMs is widely considered in literature. One method 


used by e.g. Kok and Schdn 2014 approximates the log-likelihood in ([^ using an EKF. The ML es¬ 
timation problem is subsequently solved using quasi-Newton optimization for which the gradients are 
approximated using finite differences. This results in a rapidly increasing computational complexity as 
the number of parameters grows. Other approaches are based on gradient ascent algorithms together 


with particle methods, see e.g. Poyiadjis et al. 2011 and Doucet et al. 2013 . These approaches typi¬ 
cally require the user to pre-define step sizes, which can be challenging in problems with a highly skewed 
log-likelihood. Lastly, gradient-free methods based on simultaneous perturbation stochastic approxi¬ 


mation [?], Gaussian processes optimization Dahlin and Lindsten 2014 and expectation maximiza¬ 


tion Schon et al. 2011, Kokkala et al. 2014 can be used. However, these methods typically do not 


enjoy the quadratic convergence rate of Newton methods, see e.g. Poyiadjis et al. 2011 for a discussion. 


2 Strategies for computing derivatives of the log-likelihood 

A Newton algorithm to solve ([^ iterates the update 




-If 


G{0k) 


G{dk) = m^°SPe{yi-.N)\g^g^, 

'H{0k) = Ey^.^ ^logPe(yi:Ar)|e^e^ 


(3) 


where Sk denotes a step length determined for instance using a line search algorithm Nocedal and Wright 


2006 or using an update based on stochastic approximation fPoyiadjis et al. 2011| . Here, we introduce 
the notation G{dk) and T-LiOk) for the gradient and the Hessian of the log-likelihood, respectively. In 
Algorithm]^ we present the full procedure for solving the ML problem § using <§■ 

To make use of Algorithm we require estimates of the gradient and the Hessian. We estimate the 
gradient via Fisher’s identity. 


G{d) = j §g\ogpe{xi:N,yi-.N)pe{xi-.N\yi-N)dxi.,N- 


(4) 


Here, computation of the gradient of the log-likelihood amounts to a marginalization of the gradient of the 
complete data log-likelihood with respect to the states. As the states are unknown, this marginalization 
cannot be carried out in closed form, which is the reason why the gradients cannot be computed exactly. 
The complete data likelihood for an SSM is given by 

N-l N 

Pe{xi,N,yi:N) =Pe{xi) f9{xt+i\xt)Y\_90{yt\xt), (5) 
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Algorithm 1 Newton method for ML parameter estimation 
Inputs: Initial parameter 9q, maximum no. iterations K. 
Outputs: ML parameter estimate 0 ml- 


1: Set k = 0 

2: while exit condition is not satisfied do 

a: Run an algorithm to estimate the log-likelihood i{0k), its gradient G{dk) and its Hessian 'H{9k). 
b: Determine using e.g. a line search algorithm or a stochastic schedule, 
c: Apply the Newton update ^ to obtain 9k+i. 
d: Set k = k + 1. 

end while 

3: Set 0 ml = 0fe- 


where pe{xi), feixt+ilxt) and ge{yt\xt) are given by 0. In this paper, we assume that we can compute 
the gradient of the logarithm of this expression with respect to the parameter vector. By inserting 0 
into Q, we can make use of the Markov property of the model to obtain 


G{9k) = Y^ Uedxt+i:t)peA^t+i-.t\yi:N)dxt+i-.t, ( 6 ) 

t=i c__/ 

=gt{ek) 

Ce^{xt+i-.t) = :§g\ogfs{xt+i\xt)\g^g^ + ^^oggg{yt\xt)\g^g^, 


where we interpret fg{xi\xo) = p{xi) for brevity. The remaining problem is to estimate the two-step 
joint smoothing distribution (3^t+i:t |0i:Af) and insert it into the expression. 

Furthermore, the Hessian can be approximated using the gradient estimates according to [Segal and| 


Weinstein 1989 and Meilijson 1989 . The estimator is given by 


1 T ^ 

n{9k) = - [e(0fc)] [^(^fe)] - 




(7) 


which is a consistent estimator of the expected information matrix. That is, we have that Hifi*) —^ 'H{9*) 
as N —>■ oo for the true parameters 0* if the gradient estimate G{9) is consistent. The advantage of this 
estimator is that Hessian estimates are obtained as a by-product from the estimation of the gradient. 

Both estimators in 0 and Q, require estimates of the intractable two-step smoothing distribution. 
In the two subsequent sections, we discuss the details of how to make use of the linearization and sampling 
approximations to estimate this smoothing distribution. 
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3 Linearization approximation 


To make use of linearization approximations to estimate the gradient, we treat ([^ as an expected value 
of the form 


N 


Si^k) ='^'^9^ ^9Axt+l-.t)\yi:N 




( 8 ) 


In Section |3.1[ we first consider the linear case and recapitulate the calculation in [Segal and Weinstein| 


1989 . These results are extended to nonlinear SSMs with Gaussian additive noise in Section 3.2 


3.1 Linear models with additive Gaussian noise 

The linear Gaussian SSM is given by the following special case of 0. 


Xt+i = F{9)xt + vt{9), vt(9) A/'(0, Q(9)), 

yt = G{9)xt + et{9), et{9) ~ AA(0, R{9)). 


(9) 


For notational brevity, we assume that the initial state is distributed as xi ^ JV{ytx, Pi) and is indepen- 
dent of the parameters 9. For this model, we can express the ML problem as 


N 

9^^ = a.rgma.x'^logAf (yt-,yt\t-i{d), St 


( 10 ) 


where it is possible to compute the predictive likelihood y^t-iiS) and its covariance St{9) using a KF. 

To solve (10) using Algorithm the gradient and Hessian of the log-likelihood need to be computed. 
Using ®, we first note that the complete data log-likelihood can be expressed as 


\ogpg{xi..N,yi-.N) = -^^logdetQ - ^logdeti?- 
\\\xi - Ata;||p-i - I logdetPi- 

N-l N 

I \\xt+i - ~ (11) 

t=i t=i 

where the explicit dependency on 9 has been omitted for notational simplicity. The gradient of the 
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log-likelihood Q can then be written as 


m = - ^ dt (Q-i §) - f Tr (i?-i If) - 


I Tr [ 5^(Tt|jv + Xt\NxJ\N) 


dQ-^ 

ae 


t=2 

N 


I Tr ^(Pt-i|iv + 5it-i|ivi?J-i|iv) ^ ) + 

\t^2 

Tr I + »t-i|iv2:t|iv) ^ ] + 


,T 3i{-i 


E T dR~^G'^ 1 \ ^ 

Vt ga xt\N — 2 /_^yt ae 2/f 


I Tr I ^^(-Pt|iv + Xt\Nxl\M)- 


( 12 ) 


where ir^ijv and Pt|jv denote the smoothed state estimates and their covariances, respectively. The term 
Pt-i,t\N denotes the covariance between the states Xt-i and Xt- These quantities can be computed using 


a Kalman smoother, see e.g. Rauch et al. 1965 


3.2 Nonlinear models with additive Gaussian noise 


A slightly more general SSM is given by 


xt+i = fe{xt) + vt{9), vtiO) ~ A/'(0, Q{0)), 

yt = gaixt) + et{d), et{9) ^ A/'(0, R{0)). 


(13) 


For this model, the ML problem assumes the same form as in Section [3T] but the predictive likelihood 
yt\t-i(()) and its covariance St{9) cannot be computed exactly. Instead, we replace them with estimates 
obtained from an EKF. 

To compute the gradient and the Hessian of the log-likelihood, assume for a moment that we can 
compute the complete data log-likelihood exactly. The gradient ([^ can then be computed exactly in 
two special cases: 


1. When part of the SSM (13) is linear and the parameters only enter in the linear part of the model. 


In this case, the gradients reduce to their respective linear parts in (12). 


2. When the expectation in ([^ can be computed exactly even though the model is nonlinear, for 
example in the case of quadratic terms. 

The assumption that the complete data log-likelihood can be computed exactly is clearly not true for 
nonlinear SSMs. However, good estimates of Xt\N can be obtained by solving the optimization problem 


Xt\N = argmaxpg^(a:i:Ar,yi:Ar). 

Xl-N 


( 14 ) 
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Algorithm 2 Computation of the log-likelihood and its derivatives using linearization approximation 
Inputs: A parameter estimate 9k 

Outputs: An estimate of the log-likelihood £{9k) and its gradient Q{dk) and Hessian ^(0^). 


1: Run an EKF using Ok to obtain £{9k). 


2: Solve the smoothing problem (14). 

a: Initialize using the state estimates from the EKF. 

b: while exit condition is not satisfied do 

i: Compute the gradient and approximate the Hessian of the smoothing problem 

using (15). 

ii: Use a Gauss-Newton update similar to (§ to obtain 
iii: Set i = i + 1. 


end while 


c: Set Xi-jki^jkf — xf 


j+i 


N\N- 


3: Compute Pt^t\N and Pt-i.t|Ar using ( |15D . 

4: Use ^ to estimate the gradient G{dk) and use ([^ to determine the Hessian 'H{6k). 


This is a nonlinear least-squares problem and can be solved efficiently using a standard Gauss-Newton 
solver due to its inherent sparsity. The state covariances Pt\N and Pt-i^t\N can be approximated from 
the inverse of the approximate Hessian of the complete data log-likelihood 


Pl-.N,l-.N\N{Xt\N) ^ { ff{Xt\N) J{Xt\N) 


-1 


(15) 


JiXt\N) = ^P0^ixi:N,yi:N) 


Xi.N—^l'.N 


where Pi:N. 1 :N\n represents the smoothed covariance matrix representing the covariance between all 
states. Since we are only interested in (which is short notation for Pt,t\N) and Pt-i^t\N, we are 
only interested in a few components of Pi:n, 1 :N\n- Hence, it is not necessary to form an explicit inverse, 
which would be intractable for large N. 


4 Sampling approximation 


An alternative approximation of the log-likelihood and the solution to the smoothing problem uses 
sampling methods. The approximation is based on particle filtering and smoothing and can be applied 
to more general nonlinear SSMs than the special cases introduced in Section 


Particle methods are a specific instance of sequential Monte Carlo (SMC; Doucet and Johansen 


2011) methods, when this family of algorithms is applied on general SSMs. The particle filter (PF) is 
a combination of sequential importance sampling and resampling applied to estimate the states of an 
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Algorithm 3 Computation of the log-likelihood and its derivatives using sampling approximation 
Inputs: Parameter estimate 0k and no. particles M. 

Outputs: Estimate of gradient G{0k) and Hessian ^{Ok). 


1: Run a bPF using Ok to estimate the particle system. 

(all operations are carried out over i,j = l,...,M) 

a: Sample ^ pg^{xi) and compute by Step iii. 
b: for t = 2 to A do 

(i) / (i) 

i: Resampling: sample a new ancestor from a multinomial distribution with V(a). 


J) 


ii: Propagation: Sample ~. 

iii: Weighting: Calculate which by normalisation (over i) gives 

endfor 

2: Run a smoothing algorithm, e.g. FL or FFBSi. 

3: Use (19) or (21) together with 0 to estimate the gradient QiOk) and the Hessian 'H{6k)- 


SSM. For example, the two-step smoothing distribution required in § can be estimated by 


M 


Pe{dxt+i-.t\yi-.N) = y^w](’S (i) (dxt+i:*), 

2=1 


(16) 


where and x[^'^ denote the weights and locations of the particles obtained by the PF. The particle 


system Pm = required to compute (16) is obtained by an iterative algorithm with three 

steps: resampling, propagation and weighting. The most commonly used PF is the bootstrap particle 
filter (bPF), which is summarized in Step 1 of Algorithm]^ 

The estimator in (16) can be directly inserted into 0 to estimate the gradient of the log-likelihood. 
However, this typically results in estimates suffering from high variance due to the problem of path 
degeneracy. This is a result of the successive resamplings of the particle system (Step i in Algorithm]^ 
that collapse the particle trajectories. Hence, all particles share a common ancestor, i.e. effectively we 
have M « 1. 

Alternative methods instead rely on particle smoothing to estimate the two-step smoothing distri¬ 
bution, which results in estimators with better accuracy but with an increase in the computational 


complexity. In this paper, we consider two different particle smoothers: the fixed-lag (FL; Kitagawa and 


Sato 2001) smoother and the forward filter backward simulator (FFBSi). There are numerous other 


alternatives and we refer to Lindsten and Schon 2013 for a survey of the current state-of-the-art. 
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4.1 Fixed-lag smoothing 


The FL smoother relies upon the assumption that the SSM is mixing fast and forgets about its past within 
a few time steps. More specifically, this means that we can approximate the two-step joint smoothing 
distribution by 


pe,(dxt+i:t|?/i:Ar) « (dxt+i,* 


(17) 


where k* = max(A^, t -|- 1 -|- A) for some fixed-lag 0 < A < iV. The resulting estimator has the form 


M 


P0, 


.{dxt+l-.t\yi-.N) (dxt+l-.t), 




(18) 


where denotes the ancestor at time t of particle x^} ■ The gradient can subsequently be estimated 
by inserting (18) into ^ to obtain 


N M 


Gi.^k) = ■ ( 19 ) 

t=i i=i 

The implementation of this expression is straightforward and is summarized in Algorithm See Dahlin 


2014 for the complete derivation and algorithm of the FL smoother. The FL smoother retains the 


computational complexity of the PF, i.e. 0{NM). Furthermore, it enjoys improved statistical properties 
under some general mixing assumptions of the SSM. A drawback with the FL smoother is a persisting 
bias that does not vanish in the limit when M —>■ oo. 


4.2 Forward filter backward simulator 


The second smoother that we consider is the FFBSi algorithm, which enjoys even better statistical 
properties than the FL smoother but with a computational complexity proportional to 0{NMM), 
where M denotes the number of particles in the backward sweep. The estimates obtained with the 
FFBSi smoother are asymptotically consistent and therefore unbiased in the limit when M,M —)■ oo. 

The FFBSi algorithm can be seen as the particle approximation of the Rauch-Tung-Striebel (RTS) 
smoother. The gradient can be estimated using a backward sweep after a forward pass using bPF in 
AlgorithmIn the backward sweep, we draw M trajectories by rejection sampling from the 

backward kernel given by 


M 

Bt{dxt\xt+i) = ^ 

i-1 










(dxt). 


( 20 ) 
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where Tm generated by the forward pass. The gradient can subsequently be estimated as 


1 


N M 


t=l i=l 


( 21 ) 


To decrease the computational time, we make use of the early stopping rule discussed by ?. Here, we 
sample the first Miimit backward trajectories using rejection sampling and the remaining using standard 
FFBSi. This results in a computationally efficient and accurate smoothing algorithm. See Algorithm 6 


Lindsten and Schon 2013 for the complete algorithm and a discussion of its statistical properties. 


5 Simulation results 


We evaluate the proposed methods by considering two different SSMs. For both models, we compare the 
estimates obtained using four different methods: 

1. ALG2: The Newton-based optimization in Algorithm in combination with estimates of the 
log-likelihood and its derivatives from Algorithm 

2. ALG3FL: The Newton-based optimization in Algorithm [l] in combination with estimates of 

the log-likelihood and its derivatives from Algorithm using an FL smoother, with A = 12 and 
M = 2 000. We use the sequence of step sizes given by Sk = as suggested by 

20n . 

3. ALGSFFBSi: Similar to ALG3FL but instead of an FL smoother we use an FFBSi smoother, 
with M = 2 000, M = 100 and Mumit = 10. 

4. NUM: A quasi-Newton algorithm, which computes the log-likelihood using an EKF as in Sec¬ 
tion |3.2[ but the gradients are found using finite-differences of the approximate log-likelihood 
instead. 


Poyiadjis et al. 


The latter method is included in the comparison because it is commonly used in ML parameter 


estimation approaches where the log-likelihood is approximated using linearization, see e.g. Kok and 


Schon 2014 for a concrete example. It is computationally expensive for large dimensions of the parameter 


vector, but is cheap for the low dimensional parameter vectors we consider here. Source code for the 
simulated examples is available via the first author’s homepagej^ 
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Figure 1: Comparison of the different methods for the SSM (22). Upper: The estimates of 6i and 02 


using 100 data sets. Lower: trace plots of the optimization methods for one of the data sets for 9i (left) 
and 6*2 (right). 
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5.1 Parameters in the linear part of the SSM 


The first model that we consider is given by 


Xt+i = arctanxi + Vt, Vt A/'(0,1), 

yt = SiXt + O 2 + et, et ~ A/'(0,0.1^). 


( 22 ) 


The unknown parameter vector is 6* = {0i,02}. The model (22) has nonlinear dynamics, but the 
measurement equation is linear. This corresponds to a model of type 1 as discussed in Section |3.2[ 
As the parameters are located in the linear measurement equation, the expressions for the gradient ([^ 
and Hessian 0 in Algorithmare equal to their linear counterparts in ( [T^ . 

We simulate 100 data sets each consisting of = 1 000 samples and true parameters 9* = {0.5,0.3} 
to compare the accuracy of the proposed methods. All methods are initialized in 9^ = (0.7, 0.0}. The 
parameter estimates from the four methods are presented in the upper plot in Fig.[2 The bias and mean 
square errors (MSEs) as compared to 9* are also represented in Table Note that in the model (22), 
positive and negative 9k^i are equally likely. Hence, without loss of generality we mirror all negative 
solutions to the positive plane. 

To illustrate the convergence of the different methods, the parameter estimates as a function of the 
iterations in the Newton method are shown in the lower plot in Fig. As can be seen, all four methods 
converge, but ALG2 and NUM require far less iterations than the ALG3FL and ALG3FFBSi. 


For the SSM (22) it can be concluded that ALG2 outperforms the other methods both in terms of the 


bias and the MSE. The reason is that, as argued in Section |3i^ accurate gradient and Hessian estimates 
can be obtained for this model because of its structure. Note that ALG2 and NUM not only require the 
least computational time per iteration but also require far less iterations to converge than ALF3FL and 
ALG3FFBSi, making them by far the most computationally efficient algorithms. 


5.2 Parameters in the nonlinear part of the SSM 

The second model that we consider is given by 


Xt+i — 9i aTct&nxt + Vt, vt^Af{0,l) 
yt = 92Xt + et, et ~ A/'(0,0.1^), 


(23) 


where 9i scales the current state in the nonlinear state dynamics. We apply the same evaluation procedure 
as for the previous model but with 9* = {0.7,0.5} and 9o = {0.5, 0.7}. 

Note that for this case, the gradients for ALG2 can not be computed analytically. Denoting 9k^i the 

^http://users.isy.liu.se/en/rt/manko/ 
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Figure 2: Comparison of the different methods for the SSM (231. Upper: The estimates of 6 *i and 6*2 


using 100 data sets. Lower: trace plots of the optimization methods for one of the data sets for 0i (left) 
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Table 1: The bias, MSE as compared to 0* and the computational time averaged over 100 data sets of 
N = 1000 for the SSMs (221 (model 1) and (23) (model 2). The boldface font indicates the smallest 
MSE and bias. 



Alg. 

Bias (-10-4) 

MSE (-10-4) 

Time 

9i 

02 

9i 

02 

(sec/iter) 

t-H 

ALG2 

10 

10 

1 

10 

0.81 


ALG3FL 

38 

-214 

2 

16 

4 

O 

ALG3FFBSi 

31 

53 

1 

11 

19 

§ 

NUM 

-4 

48 

1 

10 

0.19 


ALG2 

284 

-55 

28 

2 

2.73 


ALG3FL 

53 

35 

24 

2 

5 

O 

ALG3FFBSi 

55 

31 

24 

2 

22 


NUM 

31 

-27 

23 

1 

0.19 


estimate of 6 i at the fc**' iteration, we approximate the gradient with respect to 9k^i by 


N 


Gi0k,i) ^^logp0^,{xt\xt-i)\yi-.N 


N 


= ^ - I {xt-9k,la^ctanxt-i) \yv. 

t^2 

N 

(aitlAT - dfc.i arctanait_i|7v) ■ 


(24) 


Due to the necessity of the approximation (24), we expect ALG2 to perform worse for the SSM (^23| as 


compared to (22). This is confirmed by the results shown in Fig. and summarized in Table For this 

model, NUM outperforms the other methods both in bias, MSE and computational time. 


6 Conclusions 

We have considered the problem of ML parameter estimation in nonlinear SSMs using Newton methods. 
We determine the gradient and Hessian of the log-likelihood using Fisher’s identity in combination with 
an algorithm to obtain smoothed state estimates. We have applied this method to simulated data from 
two nonlinear SSMs. The first SSM has nonlinear dynamics, but the measurement equation is linear. 
The parameters only enter the measurement equation. Because of this, accurate parameter estimates 
are obtained using linearization approximations of the log-likelihood and its gradient and Hessian. In 
the second SSM, however, the parameters enter in both the linear and in the nonlinear part of the SSM. 
Because of this, the linearization approximations of the gradient and Hessian are of poor quality, leading 
to worse parameter estimates. For this SSM, the methods based on sampling approximations perform 
considerably better. However, a computationally cheap quasi-Newton algorithm which computes the log- 
likelihood using an EKF and finds the gradients using finite-differences outperforms the other methods 
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on this example. Note that this algorithm is only computationally cheap for small dimensions of the 
parameter vector. Furthermore, it highly depends on the quality of the EKF state estimates. 

In future work, we would like to study the quality of the estimates for a wider range of nonlinear 
models. It would also be interesting to compare the method introduced in this work to a solution based 
on expectation maximization. Furthermore, solving optimization problems by making use of the noisy 
estimates of the gradient and the Hessian that are provided by the particle smoother is an interesting 
and probably fruitful direction for future work. 
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